5  Missing values

Biodiversity datasets often contain missing values. However, analysis and visualisation software can differ in their default response to missing values. They might drop missing values, provide a warning about them, or return an error. We don’t always know which response we’ll get, which can cause problems for us (possibly without our knowing) during analysis later on. This is why it’s important to clean missing values effectively.

Understanding the nature of missing data can also be useful for informing how you proceed with analysis. For instance, understanding whether missing data are random or not can influence what you choose to do with them (e.g. remove vs. impute), which in turn can affect the results of statistical models based on those datasets. Identifying where missing data occur in a dataset is therefore an important part of exploratory data analysis. This chapter will show you how to detect and remove missing values.

5.0.1 Prerequisites

In this chapter, we will use gecko occurrence data since 2009 from the ALA using galah.

# packages
library(galah)
library(dplyr)
library(tidyr)

# data: Gecko records since 2009
galah_config(email = "your-email-here") # ALA Registered email

geckos <- galah_call() |>
  identify("Gekkonidae") |>
  filter(year >= 2009) |>
  select(group = "basic",
         kingdom, phylum, order, class, 
         family, genus, species, cl22, 
         eventDate, month) |>
  atlas_occurrences()

5.1 Find missing values

There are numerous ways to identify missing values in your data. Here we illustrate a few ways with very different output types.

From here we can return all rows with a missing variable.

geckos |> 
  filter(if_any(everything(), is.na)) |>
  select(scientificName, genus, 
         species, cl22, everything()) # reorder columns
# A tibble: 3,560 × 17
   scientificName    genus species cl22  recordID taxonConceptID decimalLatitude
   <chr>             <chr> <chr>   <chr> <chr>    <chr>                    <dbl>
 1 Gehyra            Gehy… <NA>    Sout… 001949e… https://biodi…          -28.1 
 2 Gehyra            Gehy… <NA>    Nort… 001ab80… https://biodi…          -16.1 
 3 Gehyra australis  Gehy… Gehyra… Nort… 002f70c… https://biodi…          -13.1 
 4 Lepidodactylus p… Lepi… Lepido… <NA>  00336d7… https://biodi…           -8.81
 5 Christinus        Chri… <NA>    Sout… 003e84f… https://biodi…          -35.0 
 6 Gehyra            Gehy… <NA>    Quee… 0046174… https://biodi…          -26.7 
 7 Gehyra nana       Gehy… Gehyra… <NA>  005ea26… https://biodi…           NA   
 8 GEKKONIDAE        <NA>  <NA>    Quee… 007172d… https://biodi…          -19.3 
 9 Gehyra            Gehy… <NA>    Nort… 0073628… https://biodi…          -22.2 
10 Christinus marmo… Chri… Christ… <NA>  007a1a3… https://biodi…          -34.9 
# ℹ 3,550 more rows
# ℹ 10 more variables: decimalLongitude <dbl>, eventDate <dttm>,
#   occurrenceStatus <chr>, dataResourceName <chr>, kingdom <chr>,
#   phylum <chr>, order <chr>, class <chr>, family <chr>, month <dbl>

Or we can specify the column to find rows with NA values one or more columns in particular (in this case column cl22).

geckos |> 
  filter(if_any(cl22, is.na)) |> 
  select(scientificName, genus, 
         species, cl22, everything()) # reorder columns
# A tibble: 1,253 × 17
   scientificName    genus species cl22  recordID taxonConceptID decimalLatitude
   <chr>             <chr> <chr>   <chr> <chr>    <chr>                    <dbl>
 1 Lepidodactylus p… Lepi… Lepido… <NA>  00336d7… https://biodi…           -8.81
 2 Gehyra nana       Gehy… Gehyra… <NA>  005ea26… https://biodi…           NA   
 3 Christinus marmo… Chri… Christ… <NA>  007a1a3… https://biodi…          -34.9 
 4 Gehyra punctata   Gehy… Gehyra… <NA>  007a329… https://biodi…           NA   
 5 Christinus marmo… Chri… Christ… <NA>  009be60… https://biodi…          -32.1 
 6 Cyrtodactylus     Cyrt… <NA>    <NA>  00a3522… https://biodi…           -5.73
 7 Hemidactylus fre… Hemi… Hemida… <NA>  00c6832… https://biodi…            1.29
 8 Cyrtodactylus     Cyrt… <NA>    <NA>  00fdca4… https://biodi…           -7.42
 9 Christinus marmo… Chri… Christ… <NA>  0143054… https://biodi…          -32.0 
10 Hemidactylus fre… Hemi… Hemida… <NA>  01aae23… https://biodi…           16.0 
# ℹ 1,243 more rows
# ℹ 10 more variables: decimalLongitude <dbl>, eventDate <dttm>,
#   occurrenceStatus <chr>, dataResourceName <chr>, kingdom <chr>,
#   phylum <chr>, order <chr>, class <chr>, family <chr>, month <dbl>

5.2 Remove NA values

5.2.1 Missing coordinates

If you are intending to make a map or run spatial analyses, it’s a good idea to exclude records with missing coordinates. Missing coordinate data often leads to one of the following: a function error, an undesirable map, or the (unintentional) exclusion of data points (many of which you might wish to include in other maps or analyses).

It’s good practice to tally and identify rows that have missing data before excluding them. We can use the same method as above to identify the records with missing coordinates.

geckos |> 
  filter(if_any(c(decimalLongitude, decimalLatitude), is.na)) |>
  count()
# A tibble: 1 × 1
      n
  <int>
1   153
geckos |> 
  filter(if_any(c(decimalLongitude, decimalLatitude), is.na)) |>
  select(scientificName, decimalLongitude, 
         decimalLatitude, everything()) # reorder columns
# A tibble: 153 × 17
   scientificName      decimalLongitude decimalLatitude recordID  taxonConceptID
   <chr>                          <dbl>           <dbl> <chr>     <chr>         
 1 Gehyra nana                       NA              NA 005ea268… https://biodi…
 2 Gehyra punctata                   NA              NA 007a3298… https://biodi…
 3 Heteronotia binoei                NA              NA 03a123ed… https://biodi…
 4 Gehyra australis                  NA              NA 05bef070… https://biodi…
 5 Gehyra occidentalis               NA              NA 0846001e… https://biodi…
 6 GEKKONIDAE                        NA              NA 0a078043… https://biodi…
 7 Gehyra xenopus                    NA              NA 0ad77ff9… https://biodi…
 8 Gehyra occidentalis               NA              NA 0c5390b2… https://biodi…
 9 Gehyra punctata                   NA              NA 11bbc940… https://biodi…
10 Gehyra nana                       NA              NA 12d9f935… https://biodi…
# ℹ 143 more rows
# ℹ 12 more variables: eventDate <dttm>, occurrenceStatus <chr>,
#   dataResourceName <chr>, kingdom <chr>, phylum <chr>, order <chr>,
#   class <chr>, family <chr>, genus <chr>, species <chr>, cl22 <chr>,
#   month <dbl>

If we decide we are happy to exclude these records, we can use drop_na() to remove missing values from our dataset.

geckos_filtered <- geckos |>
  drop_na(decimalLongitude, decimalLatitude)

We can check that drop_na() worked correctly by comparing the number of records in our initial data and our filtered data, specifically that geckos_filtered is smaller by the same number of NA values we found above.

nrow(geckos) - nrow(geckos_filtered)
[1] 153

5.3 Taxonomic values

5.3.1 Missing higher rank

Sometimes data can be missing information in columns with higher taxonomic rankings, even if observations have been taxonomically identified down to the species level. In this case, the goal isn’t only to remove the missing values, but to fill missing information with the correct information.

As an example, we’ll use a slightly modified geckos_missing dataset with some missing data add into the higher columns1. Below is a preview, and you’ll notice that there are NA values in the class and family columns as you skim across the pages.

Code
set.seed(87654) # for reproducibility

# randomly replace some row's class & family names with NAs
geckos_missing <- geckos |>
  mutate(class = replace(class, 
                         sample(row_number(), 
                                size = ceiling(0.15 * n())), 
                         NA),
         family = replace(family, 
                          sample(row_number(), 
                                 size = ceiling(0.15 * n())), 
                          NA))
geckos_missing |>
  select(scientificName, class, family, genus, species, everything())

If you have a list of taxonomic names with complete ranking information from your preferred taxonomic naming authority, you can use this information to back-fill your missing data columns. In our case, we can use names from ALA’s taxonomic backbone.

First we’ll get the complete list of Gekkonidae species taxonomic rank names using galah.

geckos_species_list <- galah_call() |>
  identify("gekkonidae") |>
  atlas_species()

To match the complete names in geckos_species_list with the missing names in geckos_missing, we can use a strategy of match-and-merge.

Using a reference column, we’ll merge our complete names data to our missing data. The information in the species_name column of geckos_species_list contains both the genus and species names; this is just like the scientificName column of our geckos_missing dataset2. We can use these columns as our reference columns.

geckos_species_list |> select(species_name) |> slice(10:15)
# A tibble: 6 × 1
  species_name           
  <chr>                  
1 Lepidodactylus lugubris
2 Gehyra montium         
3 Gehyra lazelli         
4 Nactus pelagicus       
5 Nactus eboracensis     
6 Gehyra punctata        
geckos_missing |> select(scientificName) |> slice(10:15)
# A tibble: 6 × 1
  scientificName         
  <chr>                  
1 Gehyra                 
2 Gehyra versicolor      
3 Gehyra variegata       
4 Hemidactylus frenatus  
5 Christinus marmoratus  
6 Lepidodactylus lugubris

Now we’ll select the subset of columns from geckos_species_list that we wish to join with geckos_missing (and our reference column, species_name).

list_subset <- geckos_species_list |>
  select(species_name, class, family)

We can use left_join() to merge list_subset to geckos_missing. Records that are identified at least down to the family level now have the correct rank information in the class_new and family_new columns 3.

geckos_missing |>
  left_join(list_subset, 
            by = join_by(scientificName == species_name), 
            suffix = c("_old", "_new")) |>
  select(scientificName, class_old, family_old, class_new, family_new) |>
  slice(20:30)
# A tibble: 11 × 5
   scientificName         class_old family_old class_new family_new
   <chr>                  <chr>     <chr>      <chr>     <chr>     
 1 Gehyra australis       Reptilia  <NA>       Reptilia  Gekkonidae
 2 Lepidodactylus pumilus Reptilia  Gekkonidae Reptilia  Gekkonidae
 3 Heteronotia binoei     Reptilia  Gekkonidae Reptilia  Gekkonidae
 4 Christinus marmoratus  Reptilia  Gekkonidae Reptilia  Gekkonidae
 5 Christinus             Reptilia  Gekkonidae <NA>      <NA>      
 6 Heteronotia binoei     Reptilia  Gekkonidae Reptilia  Gekkonidae
 7 Nactus eboracensis     Reptilia  Gekkonidae Reptilia  Gekkonidae
 8 Heteronotia binoei     Reptilia  Gekkonidae Reptilia  Gekkonidae
 9 Gehyra                 Reptilia  Gekkonidae <NA>      <NA>      
10 Heteronotia binoei     <NA>      <NA>       Reptilia  Gekkonidae
11 Christinus marmoratus  Reptilia  Gekkonidae Reptilia  Gekkonidae

5.3.2 Insufficient taxonomic identification

If a record is not identified down to the necessary taxonomic level required for your analysis (e.g. down to species or sub-species level), then the record should be removed.

One handy tip you can use with the galah package is to add the column taxonRank. This column usefully shows the taxonomic level of names in scientificName4. taxonRank can be useful because we can filter to only include records down to a specific rank.

geckos_rank <- galah_call() |>
  identify("gekkonidae") |>
  filter(year == 2013) |>
  select(taxonRank, scientificName) |>
  atlas_occurrences()

geckos_rank |>
  filter(taxonRank == "species")
# A tibble: 1,819 × 2
   taxonRank scientificName         
   <chr>     <chr>                  
 1 species   Lepidodactylus lugubris
 2 species   Gehyra versicolor      
 3 species   Heteronotia binoei     
 4 species   Hemidactylus frenatus  
 5 species   Gehyra dubia           
 6 species   Heteronotia binoei     
 7 species   Heteronotia binoei     
 8 species   Heteronotia binoei     
 9 species   Heteronotia binoei     
10 species   Heteronotia binoei     
# ℹ 1,809 more rows

However, it is still possible to filter records using the tools above without using the taxonRank column. In this case we remove records not identified down to the genus.

geckos_filtered <- geckos |>
  drop_na(genus) |>
  select(scientificName, genus, everything()) # reorder columns
geckos_filtered

5.4 Summary

This chapter showed ways to find and remove different types of missing records from your dataset. Importantly, there might not be one single cleaned dataset missing all missing values. You might find that in the same workflow, you will clean your dataset in multiple ways to preserve rows that are missing some, but not all, information. For example, some records have complete taxonomic identification but lack spatial coordinates. Other records have spatial coordinates but lack taxonomic information. Depending on your intended type of analysis, you may have to vary how you clean your data for each analysis!

In the next chapter we will address working with strings (character sequences). Strings often involve correcting for things like typos or extra spacing that are difficult to pick up at a first glance.


  1. The Atlas of Living Australia matches taxonomic names when data is ingested from data providers. This process means that it’s very rare for a species to be missing higher taxonomic rank names. This might not be the case for other data sources, though!↩︎

  2. This column contains the name of the lowest taxonomic rank the occurrence is identified to.↩︎

  3. This hasn’t worked for every row, however, because scientificName contains the name of the lowest taxonomic rank the occurrence is identified to. Names like Christinus and Gehyra are genus names, so in these cases scientificName won’t match our reference species_name column!↩︎

  4. This column contains the name of the lowest taxonomic rank the occurrence is identified to.↩︎